Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 156
Filtrar
1.
J R Soc Interface ; 21(212): 20230647, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-38503341

RESUMO

Cultural processes of change bear many resemblances to biological evolution. The underlying units of non-biological evolution have, however, remained elusive, especially in the domain of music. Here, we introduce a general framework to jointly identify underlying units and their associated evolutionary processes. We model musical styles and principles of organization in dimensions such as harmony and form as following an evolutionary process. Furthermore, we propose that such processes can be identified by extracting latent evolutionary signatures from musical corpora, analogously to identifying mutational signatures in genomics. These signatures provide a latent embedding for each song or musical piece. We develop a deep generative architecture for our model, which can be viewed as a type of variational autoencoder with an evolutionary prior constraining the latent space; specifically, the embeddings for each song are tied together via an energy-based prior, which encourages songs close in evolutionary space to share similar representations. As illustration, we analyse songs from the McGill Billboard dataset. We find frequent chord transitions and formal repetition schemes and identify latent evolutionary signatures related to these features. Finally, we show that the latent evolutionary representations learned by our model outperform non-evolutionary representations in such tasks as period and genre prediction.


Assuntos
Evolução Cultural , Música , Genômica
2.
Genome Res ; 2023 Dec 14.
Artigo em Inglês | MEDLINE | ID: mdl-38097386

RESUMO

Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.

3.
bioRxiv ; 2023 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-37292896

RESUMO

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

4.
Sleep Med ; 107: 212-218, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37235891

RESUMO

Public health officials and clinicians routinely advise social media users to avoid nighttime social media use due to the perception that this delays the onset of sleep and predisposes to the health risks of insufficient sleep. With some exceptions, the evidence behind this advice mostly derives from surveys identifying an association between self-reported social media usage and self-reported sleep patterns. In principle, these associations could alternatively be explained by users turning to social media to pass the time when they are otherwise having difficulty sleeping, or by individual differences that draw some people to frequent social media use, or by offline activities that overlap with both social media use and delayed sleep. To attempt to distinguish among these explanations, we leveraged estimated bedtimes from 44,000 Reddit users reported in a recent study and their 120 million posts to test whether the relationship between sleep and social media has properties suggestive of a causal relationship. We find that users are especially likely to be active on Reddit after their bedtime (and therefore awake) on nights that they posted to Reddit shortly before bedtime, especially if they posted multiple times or in high-engagement forums that night. Overall, this study lends additional support to the notion that there likely is some causal effect of evening social media use on delayed sleep onset.


Assuntos
Transtornos do Sono do Ritmo Circadiano , Mídias Sociais , Adulto , Feminino , Humanos , Masculino , Adulto Jovem , Ritmo Circadiano , Prevalência , Autorrelato , Transtornos do Sono do Ritmo Circadiano/epidemiologia , Fatores de Tempo
5.
Cell Genom ; 3(5): 100303, 2023 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-37228754

RESUMO

Although the role of RNA binding proteins (RBPs) in extracellular RNA (exRNA) biology is well established, their exRNA cargo and distribution across biofluids are largely unknown. To address this gap, we extend the exRNA Atlas resource by mapping exRNAs carried by extracellular RBPs (exRBPs). This map was developed through an integrative analysis of ENCODE enhanced crosslinking and immunoprecipitation (eCLIP) data (150 RBPs) and human exRNA profiles (6,930 samples). Computational analysis and experimental validation identified exRBPs in plasma, serum, saliva, urine, cerebrospinal fluid, and cell-culture-conditioned medium. exRBPs carry exRNA transcripts from small non-coding RNA biotypes, including microRNA (miRNA), piRNA, tRNA, small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), Y RNA, and lncRNA, as well as protein-coding mRNA fragments. Computational deconvolution of exRBP RNA cargo reveals associations of exRBPs with extracellular vesicles, lipoproteins, and ribonucleoproteins across human biofluids. Overall, we mapped the distribution of exRBPs across human biofluids, presenting a resource for the community.

6.
JMIR Form Res ; 7: e38112, 2023 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-36649054

RESUMO

BACKGROUND: Individuals with later bedtimes have an increased risk of difficulties with mood and substances. To investigate the causes and consequences of late bedtimes and other sleep patterns, researchers are exploring social media as a data source. Pioneering studies inferred sleep patterns directly from social media data. While innovative, these efforts are variously unscalable, context dependent, confined to specific sleep parameters, or rest on untested assumptions, and none of the reviewed studies apply to the popular Reddit platform or release software to the research community. OBJECTIVE: This study builds on this prior work. We estimate the bedtimes of Reddit users from the times tamps of their posts, test inference validity against survey data, and release our model as an R package (The R Foundation). METHODS: We included 159 sufficiently active Reddit users with known time zones and known, nonanomalous bedtimes, together with the time stamps of their 2.1 million posts. The model's form was chosen by visualizing the aggregate distribution of the timing of users' posts relative to their reported bedtimes. The chosen model represents a user's frequency of Reddit posting by time of day, with a flat portion before bedtime and a quadratic depletion that begins near the user's bedtime, with parameters fitted to the data. This model estimates the bedtimes of individual Reddit users from the time stamps of their posts. Model performance is assessed through k-fold cross-validation. We then apply the model to estimate the bedtimes of 51,372 sufficiently active, nonbot Reddit users with known time zones from the time stamps of their 140 million posts. RESULTS: The Pearson correlation between expected and observed Reddit posting frequencies in our model was 0.997 on aggregate data. On average, posting starts declining 45 minutes before bedtime, reaches a nadir 4.75 hours after bedtime that is 87% lower than the daytime rate, and returns to baseline 10.25 hours after bedtime. The Pearson correlation between inferred and reported bedtimes for individual users was 0.61 (P<.001). In 90 of 159 cases (56.6%), our estimate was within 1 hour of the reported bedtime; 128 cases (80.5%) were within 2 hours. There was equivalent accuracy in hold-out sets versus training sets of k-fold cross-validation, arguing against overfitting. The model was more accurate than a random forest approach. CONCLUSIONS: We uncovered a simple, reproducible relationship between Reddit users' reported bedtimes and the time of day when high daytime posting rates transition to low nighttime posting rates. We captured this relationship in a model that estimates users' bedtimes from the time stamps of their posts. Limitations include applicability only to users who post frequently, the requirement for time zone data, and limits on generalizability. Nonetheless, it is a step forward for inferring the sleep parameters of social media users passively at scale. Our model and precomputed estimated bedtimes of 50,000 Reddit users are freely available.

7.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36477833

RESUMO

MOTIVATION: While many quantum computing (QC) methods promise theoretical advantages over classical counterparts, quantum hardware remains limited. Exploiting near-term QC in computer-aided drug design (CADD) thus requires judicious partitioning between classical and quantum calculations. RESULTS: We present HypaCADD, a hybrid classical-quantum workflow for finding ligands binding to proteins, while accounting for genetic mutations. We explicitly identify modules of our drug-design workflow currently amenable to replacement by QC: non-intuitively, we identify the mutation-impact predictor as the best candidate. HypaCADD thus combines classical docking and molecular dynamics with quantum machine learning (QML) to infer the impact of mutations. We present a case study with the coronavirus (SARS-CoV-2) protease and associated mutants. We map a classical machine-learning module onto QC, using a neural network constructed from qubit-rotation gates. We have implemented this in simulation and on two commercial quantum computers. We find that the QML models can perform on par with, if not better than, classical baselines. In summary, HypaCADD offers a successful strategy for leveraging QC for CADD. AVAILABILITY AND IMPLEMENTATION: Jupyter Notebooks with Python code are freely available for academic use on GitHub: https://www.github.com/hypahub/hypacadd_notebook. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
COVID-19 , Software , Humanos , Fluxo de Trabalho , Metodologias Computacionais , Teoria Quântica , SARS-CoV-2 , Desenho de Fármacos , Simulação de Dinâmica Molecular
8.
Cell Rep ; 41(8): 111675, 2022 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-36417855

RESUMO

Many human diseases are caused by mutations in nuclear envelope (NE) proteins. How protein homeostasis and disease etiology are interconnected at the NE is poorly understood. Specifically, the identity of local ubiquitin ligases that facilitate ubiquitin-proteasome-dependent NE protein turnover is presently unknown. Here, we employ a short-lived, Lamin B receptor disease variant as a model substrate in a genetic screen to uncover key elements of NE protein turnover. We identify the ubiquitin-conjugating enzymes (E2s) Ube2G2 and Ube2D3, the membrane-resident ubiquitin ligases (E3s) RNF5 and HRD1, and the poorly understood protein TMEM33. RNF5, but not HRD1, requires TMEM33 both for efficient biosynthesis and function. Once synthesized, RNF5 responds dynamically to increased substrate levels at the NE by departing from the endoplasmic reticulum, where HRD1 remains confined. Thus, mammalian protein quality control machinery partitions between distinct cellular compartments to address locally changing substrate loads, establishing a robust cellular quality control system.


Assuntos
Proteínas de Membrana , Ubiquitina-Proteína Ligases , Animais , Humanos , Ubiquitina-Proteína Ligases/metabolismo , Proteínas de Membrana/metabolismo , Retículo Endoplasmático/metabolismo , Enzimas de Conjugação de Ubiquitina/metabolismo , Ubiquitina/metabolismo , Mamíferos/metabolismo
9.
Hum Mol Genet ; 31(R1): R114-R122, 2022 10 20.
Artigo em Inglês | MEDLINE | ID: mdl-36083269

RESUMO

Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions. Here we discuss recent and ongoing efforts to build gene regulatory maps, which aim to characterize the regulatory roles of all sequences in a genome. Many researchers and consortia have identified such regulatory elements using functional assays and evolutionary analyses; we discuss the results, strengths and shortcomings of their approaches. We also discuss new techniques the field can leverage and emerging challenges it will face while striving to build gene regulatory maps of ever-increasing resolution and comprehensiveness.


Assuntos
Regulação da Expressão Gênica , Sequências Reguladoras de Ácido Nucleico , Humanos , Regulação da Expressão Gênica/genética , Genoma Humano/genética , Mapeamento Cromossômico , DNA/genética
10.
iScience ; 25(8): 104653, 2022 Aug 19.
Artigo em Inglês | MEDLINE | ID: mdl-35958027

RESUMO

The extracellular RNA communication consortium (ERCC) is an NIH-funded program aiming to promote the development of new technologies, resources, and knowledge about exRNAs and their carriers. After Phase 1 (2013-2018), Phase 2 of the program (ERCC2, 2019-2023) aims to fill critical gaps in knowledge and technology to enable rigorous and reproducible methods for separation and characterization of both bulk populations of exRNA carriers and single EVs. ERCC2 investigators are also developing new bioinformatic pipelines to promote data integration through the exRNA atlas database. ERCC2 has established several Working Groups (Resource Sharing, Reagent Development, Data Analysis and Coordination, Technology Development, nomenclature, and Scientific Outreach) to promote collaboration between ERCC2 members and the broader scientific community. We expect that ERCC2's current and future achievements will significantly improve our understanding of exRNA biology and the development of accurate and efficient exRNA-based diagnostic, prognostic, and theranostic biomarker assays.

12.
Nat Rev Genet ; 23(4): 245-258, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-34759381

RESUMO

The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.


Assuntos
Privacidade Genética , Privacidade , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Medição de Risco
14.
Sci Rep ; 11(1): 21705, 2021 11 04.
Artigo em Inglês | MEDLINE | ID: mdl-34737331

RESUMO

RNA-seq has matured and become an important tool for studying RNA biology. Here we compared two RNA-seq (MGI DNBSEQ and Illumina NextSeq 500) and two microarray platforms (GeneChip Human Transcriptome Array 2.0 and Illumina Expression BeadChip) in healthy individuals administered recombinant human erythropoietin for transcriptome-wide quantification of differential gene expression. The results show that total RNA DNB-seq generated a multitude of target genes compared to other platforms. Pathway enrichment analyses revealed genes correlate to not only erythropoiesis and oxygen transport but also a wide range of other functions, such as tissue protection and immune regulation. This study provides a knowledge base of genes relevant to EPO biology through cross-platform comparisons and validation.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , Eritropoese/genética , Eritropoetina/genética , Expressão Gênica/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , RNA/genética , RNA-Seq/métodos , Transcriptoma/genética
15.
Genome Biol ; 22(1): 287, 2021 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-34620211

RESUMO

BACKGROUND: The diversity of genomic alterations in cancer poses challenges to fully understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that reside in the "long tail" of the mutational distribution, uncovered new genes with significant implications in cancer development. The study of cancer-relevant genes often requires integrative approaches pooling together multiple types of biological data. Network propagation methods demonstrate high efficacy in achieving this integration. Yet, the majority of these methods focus their assessment on detecting known cancer genes or identifying altered subnetworks. In this paper, we introduce a network propagation approach that entirely focuses on prioritizing long tail genes with potential functional impact on cancer development. RESULTS: We identify sets of often overlooked, rarely to moderately mutated genes whose biological interactions significantly propel their mutation-frequency-based rank upwards during propagation in 17 cancer types. We call these sets "upward mobility genes" and hypothesize that their significant rank improvement indicates functional importance. We report new cancer-pathway associations based on upward mobility genes that are not previously identified using driver genes alone, validate their role in cancer cell survival in vitro using extensive genome-wide RNAi and CRISPR data repositories, and further conduct in vitro functional screenings resulting in the validation of 18 previously unreported genes. CONCLUSION: Our analysis extends the spectrum of cancer-relevant genes and identifies novel potential therapeutic targets.


Assuntos
Genes Neoplásicos , Neoplasias/genética , Sobrevivência Celular , Genes Neoplásicos/efeitos dos fármacos , Humanos , Mutação , Neoplasias/metabolismo , Mapeamento de Interação de Proteínas
16.
Sports Med ; 51(11): 2237-2250, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34468950

RESUMO

Millions of consumer sport and fitness wearables (CSFWs) are used worldwide, and millions of datapoints are generated by each device. Moreover, these numbers are rapidly growing, and they contain a heterogeneity of devices, data types, and contexts for data collection. Companies and consumers would benefit from guiding standards on device quality and data formats. To address this growing need, we convened a virtual panel of industry and academic stakeholders, and this manuscript summarizes the outcomes of the discussion. Our objectives were to identify (1) key facilitators of and barriers to participation by CSFW manufacturers in guiding standards and (2) stakeholder priorities. The venues were the Yale Center for Biomedical Data Science Digital Health Monthly Seminar Series (62 participants) and the New England Chapter of the American College of Sports Medicine Annual Meeting (59 participants). In the discussion, stakeholders outlined both facilitators of (e.g., commercial return on investment in device quality, lucrative research partnerships, and transparent and multilevel evaluation of device quality) and barriers (e.g., competitive advantage conflict, lack of flexibility in previously developed devices) to participation in guiding standards. There was general agreement to adopt Keadle et al.'s standard pathway for testing devices (i.e., benchtop, laboratory, field-based, implementation) without consensus on the prioritization of these steps. Overall, there was enthusiasm not to add prescriptive or regulatory steps, but instead create a networking hub that connects companies to consumers and researchers for flexible guidance navigating the heterogeneity, multi-tiered development, dynamicity, and nebulousness of the CSFW field.


Assuntos
Medicina Esportiva , Esportes , Dispositivos Eletrônicos Vestíveis , Consenso , Exercício Físico , Humanos
17.
J Biol Chem ; 297(2): 100937, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34224731

RESUMO

The endoplasmic reticulum (ER) is a membrane-bound organelle responsible for protein folding, lipid synthesis, and calcium homeostasis. Maintenance of ER structural integrity is crucial for proper function, but much remains to be learned about the molecular players involved. To identify proteins that support the structure of the ER, we performed a proteomic screen and identified nodal modulator (NOMO), a widely conserved type I transmembrane protein of unknown function, with three nearly identical orthologs specified in the human genome. We found that overexpression of NOMO1 imposes a sheet morphology on the ER, whereas depletion of NOMO1 and its orthologs causes a collapse of ER morphology concomitant with the formation of membrane-delineated holes in the ER network positive for the lysosomal marker lysosomal-associated protein 1. In addition, the levels of key players of autophagy including microtubule-associated protein light chain 3 and autophagy cargo receptor p62/sequestosome 1 strongly increase upon NOMO depletion. In vitro reconstitution of NOMO1 revealed a "beads on a string" structure likely representing consecutive immunoglobulin-like domains. Extending NOMO1 by insertion of additional immunoglobulin folds results in a correlative increase in the ER intermembrane distance. Based on these observations and a genetic epistasis analysis including the known ER-shaping proteins Atlastin2 and Climp63, we propose a role for NOMO1 in the functional network of ER-shaping proteins.


Assuntos
Retículo Endoplasmático , Proteômica , Proteína Sequestossoma-1 , Autofagia , Estresse do Retículo Endoplasmático , Homeostase , Humanos , Lisossomos/metabolismo
18.
Cancer Res ; 81(16): 4194-4204, 2021 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-34045189

RESUMO

STK11 (liver kinase B1, LKB1) is the fourth most frequently mutated gene in lung adenocarcinoma, with loss of function observed in up to 30% of all cases. Our previous work identified a 16-gene signature for LKB1 loss of function through mutational and nonmutational mechanisms. In this study, we applied this genetic signature to The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples and discovered a novel association between LKB1 loss and widespread DNA demethylation. LKB1-deficient tumors showed depletion of S-adenosyl-methionine (SAM-e), which is the primary substrate for DNMT1 activity. Lower methylation following LKB1 loss involved repetitive elements (RE) and altered RE transcription, as well as decreased sensitivity to azacytidine. Demethylated CpGs were enriched for FOXA family consensus binding sites, and nuclear expression, localization, and turnover of FOXA was dependent upon LKB1. Overall, these findings demonstrate that a large number of lung adenocarcinomas exhibit global hypomethylation driven by LKB1 loss, which has implications for both epigenetic therapy and immunotherapy in these cancers. SIGNIFICANCE: Lung adenocarcinomas with LKB1 loss demonstrate global genomic hypomethylation associated with depletion of SAM-e, reduced expression of DNMT1, and increased transcription of repetitive elements.


Assuntos
Quinases Proteína-Quinases Ativadas por AMP/fisiologia , Adenocarcinoma/genética , Metilação de DNA , Neoplasias Pulmonares/genética , S-Adenosilmetionina/metabolismo , Quinases Proteína-Quinases Ativadas por AMP/genética , Adenocarcinoma/metabolismo , Linhagem Celular , Sobrevivência Celular , Análise por Conglomerados , Biologia Computacional , Ilhas de CpG , Bases de Dados Genéticas , Epigênese Genética , Genes ras , Humanos , Neoplasias Pulmonares/metabolismo , Metionina , Mutação , Análise de Sequência com Séries de Oligonucleotídeos , Proteínas Proto-Oncogênicas p21(ras)/genética , Sequências Repetitivas de Ácido Nucleico
19.
Am J Hum Genet ; 108(5): 919-928, 2021 05 06.
Artigo em Inglês | MEDLINE | ID: mdl-33789087

RESUMO

Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.


Assuntos
Genoma Humano/genética , Variação Estrutural do Genoma , Genômica/métodos , Objetivos , Sequenciamento Completo do Genoma/métodos , Sequenciamento Completo do Genoma/normas , Variações do Número de Cópias de DNA , Éxons/genética , Humanos , Projetos de Pesquisa , Duplicações Segmentares Genômicas , Alinhamento de Sequência
20.
Science ; 372(6537)2021 04 02.
Artigo em Inglês | MEDLINE | ID: mdl-33632895

RESUMO

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.


Assuntos
Variação Genética , Genoma Humano , Haplótipos , Feminino , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação INDEL , Sequências Repetitivas Dispersas , Masculino , Grupos Populacionais/genética , Locos de Características Quantitativas , Retroelementos , Análise de Sequência de DNA , Inversão de Sequência , Sequenciamento Completo do Genoma
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...